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1  Introduction 

This  final  report  describes  work  peiformed  by  Information  Extraction  &  Tianspoit  (lET),  Inc.  on 
Challenge  Problem  Development  and  Evaluation  Management  for  the  Defense  Advance 
Research  Projects  Agency’s  (DARPA’s)  High  Performance  Knowledge  Bases  (HPKB)  program. 
HPKB  had  the  objective  to  develop  innovative  technologies  supporting  constiuction  (  y 
knowledge  engineers)  of  knowledge  bases,  ontologies,  and  associated  libraries  of  problem- 
solving  strategies.  EET  was  responsible  for  developing  a  cnsis  management  (CM)  challenge 
problem  (CP)  to  focus  and  evaluate  HPKB  technology. 

lET  was  supported  by  subcontractor  Pacific-Sierra  Research  (now  known  as  Veridian  Systems)_ 
Together  lET  and  Veridian  Systems— with  occasional  consulting  from 

the  University  of  Massachusetts— constituted  the  evaluation  team  (ET)  for  the  CM  CP.  Ihe  lm 
CP  was  posed  primarily  to  HPKB’s  two  integration  teams  (ITs),  known  by  the  names  of  their 

lead  contractors: 

•  Teknowledge  Federal  Systems  (TFS);  and 

•  Science  Applications  International  Corporation  (SAIC). 

HPKB  spanned  three  funding  years  (U.S.  Government  Fiscal  Years  1997-1999),  but  only  two 
program — and  evaluation — years. 

•  Year  1  (Yl)  ran  from  June,  1997  through  July,  1998.' 

•  Year  2  (Y2)  ran  from  July,  1998  through  October,  1999. 

From  the  program’s  outset,  lET  maintained  a  Web  site  disseminating  all  of  its  HPKB  products 
rhnn://xvww.iet  rom/Proiects/HPKB).  A  visitor  can  find  overview  briefings,  specification 
materials,  evaluation  materials,  and  evaluation  results  reports. 

As  a  point  of  departure.  Figure  1  depicts  the  CP  development  methodology  lET  created  HPKB. 


Figure  1:  Challenge  problem  methodology 

CPs  must  merge  needs  of  target  applications  with  opportunities 

technologies  into  a  productive  task  intersection,  tempered  by  practical  customer  (DARPA) 
technology  developer  constraints.  Hitting  just  the  right  level  of  difficulty  requires  a  thoroug 
understanding  of  the  application  domain,  the  technology,  and  the  reasonably  expected  pace  of 


'  The  HPKB  Yl  evaluation  was  featured  in  the  Winter  1998  issue  (Volume  19,  Number  )  «  • 

DARPA  High-Performance  Knowledge  Bases  Project,”  by  Paul  Cohen.  Robert  Schrag,  Eric  Jones.  Adam 
Albert  Lin,  Barbara  Starr,  David  Gunning,  and  Murray  Burke  (pages  25  -  49). 
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technical  development.  It  is  important  to  set  the  bar  high  without  making  the  jump  impossible, 
to  make  the  task  feasible  without  trivializing  it. 

2  Knowledge  representation  and  reasoning  requirements 

The  CM  CP’s  application  context  was  the  support  of  intelligence  analysts  or  their  automated 
agents  in  interpreting  international  events.  LET  worked  with  Veridian’s  subject  matter  experts  to 
develop  an  outline  of  the  tasks  analysts  typically  run  through  when  they  are  tasked  by  policy- 
/decision-makers.  See  the  outline  below. 

I.  Information  gathering 

n.  Situation  assessment 

A.  Explanation 

-  Capabilities,  motives,  intents,  risks,  rewards 

B.  Ramification 

-  Effects  on  actor  interests 

C.  Context 

Interests,  policies,  ideologies,  alliances,  enmities 
in.  Scenario  development 

A.  Action  option  generation 

B.  Option  evaluation 

C.  Likelihood  rating^ 

Information  gathering  includes  tasks  corresponding  to  the  journalist’s  questions  “What 
happened?”,  “What  does  it  mean?”,  and  “What  might  happen  next?”.  Situation  assessment  (or 
interpretation)  includes  explanation  and  ramification  factors  pertaining  to  a  specific  situation  at 
hand  and  context  factors  contributing  to  a  “strategic  culture”  for  a  national  actor’s  behavior  in 
international  relations.  Scenario  development  (or  predictive  speculation)  starts  with  the 
generation  of  plausible  actions  for  each  crisis  actor.  Then  options  are  evaluated  with  respect  to 
the  same  factors  as  in  Part  II  (situation  assessment)  and  a  likelihood  rating  is  produced,  with  the 
most  plausible  actions  being  reported  back  to  the  policy  makers. 

In  the  tasks  of  intelligence  analysis,  there  are  some  classical  opportunities  for  the  application  of 
knowledge-based  systems.  These  opportunities  (presented  as  institutional  deficits  )  are 
depicted  in  Figure  2. 


^  This  is  the  only  analytic  process  outline  element  not  posed  to  be  addressed  by  ITs’  development  of  international 
political  common  sense  in  the  CM  CP. 
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•  Smaller  corps,  larger  workload 

“Analysis  Deficit” 

•  “New”  actors,  events  in  international 
system 

•  New  analysts 

“Strategic  Culture  Deficit” 
“Corporate  Memory  Deficit” 

•  “Old”  analysts 

“Creativity  Deficit”- 


Lessons  learned 

•  Historical  cases 

•  Analogies 

•  Counterf actuals 


“Out-of-the-box”  thinking 


Figure  2:  Crisis  reasoning  objectives 
Two  main  themes  in  Figure  2  are  the  use  of  knowledge  bases  to 

expertise  and  the  use  of  Al-based  search  to  generate  analytical  possibilities  which  otherwise 
might  not  be  considered.  The  latter  usage  requires  development  of  extensive  common  sense 
knLledge,  or  “analyst’s  sense,”  about  the  domain  to  rule  out 

from  an  analyst’s  point  of  view.As  lET  reviewed  these  opportunities  with JDARPA  s  Project 

Genoa  leader  Admiral  John  Poindexter,  he  asked  us  to  concentrate  in 

could  help  analysts  break  out  of  their  “ruts”  of  routine  analysis  (“out-of-the-box 

in  Y2  on  how  a  knowledge-based  corporate  memory  could  aid  analysts  (in  ways  indicated  in  the 

callout  bubble). 

LET  realized  that  to  address  these  reasoning  challenges  KBs  must  capture  something  akin  to 
“international  political  common  sense,”  a  notion  that  we  depict  schematically  in  igure 

Crisis  analysis: 

I.  What  happened/  Who  did  it? 

II.  Why  did  it  happen  /  What  does  it  mean  ? 

III.  What  might  happen  next? 


Crisis  representation: 


precedents, 
historical 
analogies 


Crisis  corporate 
memory 


events, 

episodes 


Figure  3:  International  political  common  sense 
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The  overlapping  ovals  in  Figure  3  suggest  how  concepts  pertaining  to  actors,  actions,  and 
interests  interact.  In  this  model,  actions  are  motivated  by  interests  but  balanced  by  risks  and 
rewards.  Actions  have  impacts  and  require  capabilities.  Interests  drive  the  formation  of 
alliances,  the  exercise  of  influence,  and  the  generation  of  tensions  among  actors.  All  of  these  hill 
against  a  backdrop  of  cuirent  and  historical  context. 

3  CP  specification 

lET  began  each  evaluation  year  by  defining  and  refining  a  CM  CP  specification  with  the 
following  major  elements,  some  of  which  (those  marked  “*”)  we  take  up  in  subsections.  We 
refer  readers  to  our  HPKB  Web  pages  for  a  more  comprehensive  treatment. 

•  ^Domain  scenario  (crisis  storyline)  and  related  historical  incident  descriptions 

•  Source  material  (Web-based  and  custom-developed  background) 

•  Domain-specific  conceptualizations  (pre-formal  ontologies/KBs) 

•  Access  to  subject  matter  experts  (SMEs) 

•  Sample  questions  (SQs)  and  sample  answers 

•  ^^Parameterized  questions  (PQs)  and  supporting  PQ  grammar,  encoding  a 
combinatorially  large  space  of  possible  test  questions  (TQs) 

•  '  TQ  answer  scoring  criteria  and  score  aggregation  methods 

3.1  Crisis  scenarios  and  related  historical  incidents 

lET,  in  partnership  with  Veridian  Systems,  developed  Y1  and  Y2  fictional  crisis  scenarios  set  in 
the  Middle  East.  Where  possible,  real  events  and  people  were  referred  to  in  order  to  provide 
both  realism  and  source  availability.  The  exercise  of  crisis  corporate  memory  in  Y2  also 
required  a  body  of  related  historical  incidents,  or  “cases”  (shown  in  Figure  5). 

3.1.1  Y1  Scenario 

The  Y1  crisis,  which  takes  place  in  the  Persian  Gulf  region,  involves  hostilities  between  Saudi 
Arabia  and  Iran,  culminating  in  closure  of  the  Strait  of  Hormuz  to  international  shipping.  As 
seen  in  Figure  4,  the  Strait  of  Hormuz  forms  a  strategic  chokepoint,  less  than  40  miles  across, 
through  which  a  large  percentage  of  the  world’s  oil  flows.  The  Iranians  currently  have  missiles 
that  can  reach  the  Strait’s  shipping  channel  from  Iranian  soil — and  offshore  islands — in  less  than 
two  minutes.  Iran  considers  its  ability  to  control  access  to  the  Strait  a  political,  military,  and 
economic  tool.  The  US,  along  with  Europe  and  Japan,  consider  access  to  the  Gulf  via  the  Strait 
of  Hormuz  a  strategic  imperative. 
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Figure  4:  Persian  Gulf  region  with  Strait  of  Hormuz 

In  the  Y1  scenario,  Iran  is  vying  for  hegemonic  status  in  the  region,  and  is  critical  of  Saudi 
Arabia  for  its  pro-Westem  stance.  Saudi  Arabia  is  extremely  wary  of  Iranian  designs  on  the  Gulf 
and  pro-Iranian  factions  within  its  borders.  The  continued  inability  of  the  OPEC  structure  to 
control  oil  production  exacerbates  the  situation.  The  scenario  moves  through  four  stages  as  the 
conflict  escalates.  It  ends  with  Iran  attacking  several  Saudi  tankers,  and  declaring  the  Strait  of 
Hormuz  closed  to  traffic.  The  result  of  these  actions  is  a  series  of  armed  clashes  among  several 
regional  powers  and  United  States  forces. 


3.1.2  Y2  Scenario 


In  HPKB  Y2,  the  Persian  Gulf  remained  a  highly  topical  setting  for  a  scenario  in  light  of 
persistent  tension  and  competition  among  regional  actors  over  economic,  security,  and 
sociopolitical  interests,  the  proliferation  of  weapons  of  mass  destruction  (WMD),  and  ongoing 
US  and  Western  efforts  to  bolster  stability  and  ensure  the  uninterrupted  supply  of  critical  energy 
resources.  The  Y2  scenario  used  as  its  real,  current  situation  ongoing  tensions  surrounding  a 
dispute  over  the  route  of  a  proposed  oil  pipeline^,  Iran’s  economic  difficulties,  and  Iran’s  well- 
documented  desire  to  weaken  the  regional  role  of  Saudi  Arabia  while  enhancing  its  own.  The 
scenario  ended  with  a  fictional  but  plausible  excursion  from  the  real,  historical  situation. 


^  See  map  (figure  5)  showing  the  area  of  interest  in  Year  2,  as  well  as  the  proposed  pipeline  routes. 
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Figure  5:  Y2  scenario  proposed  pipeiine  routes  and  historicai  cases  (orange) 


3  2  PQs 

quciions  (SQs),  giving  ris  a  well-defined,  but  largely  un-  gamable,  target  space,  f.gu 
demonstrates  the  PQ  notion  schematically. 
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Figure  6.  Schematic  PQ 

By  following  the  arrows  and  reading  the  italicized  terms  in  Figure  6,  one  may  find  a  rendering  ot 
CM  CP  sample  question  SQ2:  “Among  states  bordering  on  the  Persian  Gulf,  which 
produces  the  most  oil?’’.  In  each  of  the  boxes  are  terms  which  are  “ontologically  close”  to 
corresponding  terms  in  the  sample  question.  By  varying  terms  in  this  controlled  way,  lET  is  able 
to  produce  TQs  semantically  close  to  the  given  SQ,  thus  limiting  practically  the  scope  of 
knowledge  required  to  be  encoded  by  the  ITs,  and  giving  them  more  indication  of  what  to  expect 
in  the  actual  evaluation  period. 

Figure  7  presents  the  Y1  space  of  “analytical”  PQs  (those  driven  by  the  tasks  of  intelligence 
analysis,  vice  simpler  PQs  intended  merely  to  perform  a  diagnostic  function  on  the  KB — ”KB- 
diagnostic”  PQs). 
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Question 

Sample  question 

Parameterized  template 

Number 

possible 

- 

International 

What  terrorist  gi\>up  favoring  interests  of 
Iran  and/or  supported  by  Iran  exists  within 
Saudi  Aralda? 

What  <liuenui!it>nalAgenrf  ype> 

{opposing,  favoring}  interests  of 
<lniernalionalAuc*iu2>  [and/or  supported 

svsteni 

3.5  billion' 

•  Actors 

•  Actions 

by  <lnu'riuui(m;il.\i;<-'i''?>l  exists  within 
<liuoni;Hii’iiiil.Vi!'-Til4>? 

•  Interests 

Intellieence 

What  risks  would  Iran  face  in  exposure  ot 
its  supporting  a  terrorist  group  in  Saudi 
Arabia? 

What  {risks,  rewards}  would 
<lnteriiational.‘\aent>  face/expect  in 

analvsis 

1.4  billion 

*  Capabilities 

•  Risks 

<lniernalionalActionType>? 

•  Rewards 

Scenario 

What  is  the  number  of  dead  caused  by  the 
terrorist  attack  on  ilie  oil  port  during  Day 
22? 

What  is  the  <ScenanoActionResiili> 
caused  by  the  <ScenarioAclion>  [during 
<ScenanoTimelnterval>]? 

iinderstandins 

34  million 

•  Events 

•  Cause 

•  EITccl 
Background 

Has  Iran  ever  sponsored  a  terrorist  group 
performing  a  terrorist  attack? 

Has  <liUenuUionalAgentl>  ever 
<lntemational.fVctii)nType>  | in 
<lniern  ;tiiiiiiul,Aecnl2>r.‘ 

38  million 

•  Economics 

•  Politics 

•  Military 

•  History 

•  Geography 

What  kinds  of  weapons  ol  mass 
destruction  is  Iran  believed  to  possess? 

What  [kinds  of]  <MilitaryHardwareType> 
does/is  <Country>  (possess,  believed  to 
possess,  have  under  development)? 

17  thousand 

. 

Figure  7:  Y1  analytical  PQ  summary 
Roure  7‘s  color  coding  indicates  SQ  instantiations  of  PQ  classes  and  grammar  constructs.  The 
combinatorial  possibilities  for  generating  syntactically  valid  TQs  were 

the  number  of  acceptable  TQs  was  limited  by  semant.c  constraints  and  by  the  published  source 


materials. 

Our  PO  understanding  heretofore  articulated  allows  us  to  provide  a  more  coherent  description  of 
Se  enito  ‘ha.  we  were  asking  fTs  to  develop.  These  co-Pe.encies  are 

l„diSte“in  the  sentence  .ha,  may  be  read  from  the  sree,,  Micfo,,,  in  th^sc  below  (with  CP 
elements  supporting  each  sentence  fragment  indicated  in  the  black  normal  font). 


•  Reason  in  modes  ofiulelligence  analysis... 

—  Representative  analytical  PQs 

—  Crisis  scenarios  &  related  historical  incidents 

•  ...based  on  domain-specific  conceptualizations... 

-  International  political  system 
—  Scenario-  and  case-involved  transnational  actors 


•  ...using  common  sense. 

—  Representative  KB-diagnostic  PQs 


3.3  TQ  answer  scoring  and  score  aggregation 

To  accommodate  the  fact  that  evaluating  TQ  answers 
subjective  judgment,  lET  devised  a  discrete  0-3  scoring 
clearer-cut  judgments  than  more  wide-ranging  scales. 


produced  by  KBs  requires  significant 
scale,  schematized  below.  This  led  to 
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0.  Completely  off-target 

1.  Mostly  off-target 

2.  Mostly  on-target 

3.  Completely  on-target 

Figure  8  explains  how  aggregate  scores  were  computed  from  raw  scores  assigned  for  a  given  TQ. 
Individual  scores  were  assigned  for  criteria  falling  into  four  categories:  representation,  answer, 
explanation,  and  source.  Each  category  includes  at  least  one  basic  and  zero  or  more  extra-credit 
scoring  criteria.  The  basic  criteria  are  used  to  determine  a  basic  score  (left-hand  side  of  Figure 
8).  Accounting  for  the  extra-credit  criteria  yields  an  “overall”  score  (right-hand  side).  Here,  we 
account  only  for  the  extra-credit  criterion  “compositionality”  (assuming  other  extra-credit  criteria 
receive  0  scores,  e.g). 


Raw  Basic  Scores  -  Criteria 

Representation  Formulation  2 

Representation  Definitions  3  *2/3 

Answer  Correctness  3 

Explanation  Correctness  2 

Source  Recording  2 

i 

Gated  Basic  Scores  -  Criteria 


Representation  Formulation  2*2/3 
Representation  Definitions  2*1/3 


Answer  Correctness 

3 

Explanation  Correctness 

2 

Source  Recording 

/ 

2 

^  criterion 

Gated  Basic  Scores  - 

Category 

Representation 

2*1 

Answer 

3  *3 

Explanation 

2  *3 

Source 

2*1 

Raw  Overaii  Scores 

Basic  Representation  2 

Representation  Compositionality  3  *2/3 
Answer  3 

Explanation  2 

Source  2 

/ 

Gated  Overaii  Scores 

Basic  Representation  2 


Representation  Compositionality  2*1 
Answer  3 

Explanation  2 

Source  2 

I 

L  criterion 

Gated  Overall  Scores  ~  Category 
Representation  4*7 

Answer  3  *3 

Explanation  2*3 

Source  2*1 

^ 

category  *  weight 

2.56  of  5.81 
(4.69  max) 


/ 


£  /  L 

category  'weight 


2.11  Of  3 


Weighted  Gated  Score 
Basic  Overall 


Figure  8:  Y2  TQ  answer  score  calculation 

Along  the  left-hand  side  of  Figure  8 — basic  scores — we  have  the  following,  starting  from  the 
top. 

•  Scores  for  criteria  (such  as  definitions)  that  are  ancillary  to  a  category  (here, 
Representation),  are  “gated” — that  is,  reduced  by  weighting  with  respect  to  the 
category’s  main  criterion’s  score  (here.  Formulation)  by  the  fraction  of  that  score  over  a 
perfect  score  (3).  This  is  to  prevent  the  domination  of  aggregate  TQ  scores  by  ancillary 
criteria  (and  the  possibly  accompanying  temptation  toward  gaming).  Gating  reduces  the 
Definitions  score  from  3  to  2. 
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o  The  same  thing  occurs  for  the  ancillary  extra-credit  criterion  Compositionality  at 
the  top  right. 

Next  (left,  middle),  to  these  gated  scores  are  applied  criteria-specific  weights  to  achieve 
an  overall  score  for  the  category.  (The  sum  of  basic  score  weights  over  cnteria  foi  a 
given  category  always  equals  1.)  Only  the  representation  category  has  multiple  basic 

criteria. 

o  Compositionality  is  treated  similarly,  except  that  extra-credit  cntena  weights  foi 
a  given  category  do  not  necessarily  sum  to  1. 

Next  (left,  bottom),  the  category  scores  are  weighted,  and  the  weighted  average  is  used 
as  an  overall  basic  score.  (The  maximum  is  3.) 

o  The  same  weighted  averaging  is  performed  for  the  overall  scoring  (where  the 
maximum  possible  and  maximum  observed  scores  were  5.81  and  4.by, 
respectively). 


4  Evaluation  procedures  and  results 


Figure  9  schematically  depicts  (not  to  a  natural  time  scale)  the  major  events  involved  in  the 
annual  evaluation  cycle. 

Release 
SQs  &  PQs 


Release 
TQ  batch 
&  score  answers 


Release 

“close”  TQ’  batch 
&  score 

Figure  9:  Annual  specification-to-evaluation  cycle 

After  the  ET’s  release  of  a  CP  spec,  ITs  would  undertake  KB  and  supporting  technology 
development.  The  start  of  an  evaluation  year-culminating,  end-to-end  evaluation  w  ic  as 
only  a  couple  of  weeks  during  the  early  summer-was  marked  by  the 

batches.  ITS  would  have  a  limited  amount  of  time  (less  than  a  dayUo  ehctt  them  ^s  response 
to  these  batches  The  ET  would  then  score  ITs’  answers,  and  ITs,  using  the  ET  s  scores  as 
todback  would  repair  (given  a  day  or  two)  their  KBs.  This  repair  was  in  preparation  for 
response 'to  a  subsequent  batch  of  TQs  that  the  ET  had  generated  (on  a  one-for-one  basis)  from 
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the  same  PQs  as  had  been  TQs  in  the  baseline  batch.  The  ET  would  then  score  the  repaired  KBs’ 
TQ  answers  for  the  baseline  batch  (now  referred  to  as  a  “repair”  batch  to  distinguish  the  scores) 
and  for  the  new,  robustness-checking,  or  “close,”  TQ  batch.  In  all  cases,  KB  testing  was 
“hands-off” — no  modifications  to  the  KB  were  allowed  during  TQ  answering. 

Figure  10  shows  basic  scores,  averaged  over  TQs  for  the  batches  administered  during  HPKB  Y1 
and  Y2.'*  All  of  the  cross-IT  scoring  differences  are  statistically  significant  (based  on  a  paired  t- 
test  over  TQ  answer  scores)  except  for  the  final  Y2  batch — TQF. 

3  1 . . . 


Figure  10:  Y2  scores  with  re-aggregated  Y1  scores 
The  batches  noted  in  Figure  10  were  related  according  to  Table  1. 


Year,  phase 

Baseline  batch 

Repair  batch 

Close  batch 

Yl,  Phase  1 

TQA 

TQAp 

TQB 

Yl,  Phase  2 

TQC 

TQCp 

TQD 

Y2 

TQEbaseline 

TQEmidpoint, 

TQEfinal 

TQF 

Table  1:  Evaluation  TQ  batch  relationships 


The  HPKB  evaluations  were  run  as  a  “friendly”  competition.  The  graph  in  Figure  10  (and  others 
like  it  available  on  lET’s  HPKB  Web  pages)  should  be  treated  in  this  spirit.  lET’s  major 
contributions  to  HPKB  were  in  terms  of  evaluation  methodology,  challenge  problem 
development,  and  challenge  problem  administration. 


'*  Y1  basic  scores  were  calculated  using  a  slightly  different  method.  In  Figure  10,  we  have  re-aggregated  individual 
Y1  TQs’  scores  with  normalizations  to  support  using  the  Y2  method. 
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